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ABSTRACT: Cryptographic algorithms such as International Data Encryption Algorithm (IDEA) have found various 
applications in secure transmission of the data in networked instrumentation and distributed measurement systems. Modulo 
2n +1 multiplier and squarer play a pivotal role in the implementation of such crypto -algorithms. In this work, an efficient 
hardware design of the IDEA (International Data Encryption Algorithm) using novel modulo 2n + 1 multiplier and squarer 
as the basic modules is proposed for faster, smaller and low-power IDEA hardware circuits. Novel hardware 
implementation of the modulo 2n + 1 multiplier is shown by using the efficient compressors and sparse tree based inverted 
end around carry adders is given. The novel modules are applied on IDEA algorithm and the resulting implementation is 
compared both qualitatively and quantitatively with the IDEA implementation using the existing multiplier/ squarer 
implementations. Experimental measurement results show that the proposed design is faster and smaller and also consume 
less power than similar hardware implementations making it a viable option for efficient hardware 
designs. Yet, despiteitssophistication, many futureattempt sat crackingDESshowedsignificant signsof success. E or example, 
thedistributive computing approachofspreadingcracking computationpoweroverthe Internet earnedRockeVerserandMichael 
Sandersthe prize ofthe!997DESChallenge.DESChallenge II wasalso 

crackedthefollowingyear. WiththeinventionoftheElectronic ErontierEoundationDES Cracker, itwasshownthat a 56 -bit 
keyprotectionisinsufficient againstexhaustivesearchemployedwithtoday 'stechnology. Therefore, there 

wasanurgentcallforastrongersecret-keyencryptionalgorithm.IDEAwasone ofthe algorithmstoanswerthatcall. 

Key words — modulo 2n + I multiplier; International Data Encryption Algorithm (IDEA); Sparse-tree adder; 
Power/area/speed measurement; 



I. INTRODUCTION 

Thedemand for highsecurityincommunicationschannels,networkedInstrumentationanddistributedmeasurement 
systemsisevergrowing rapidly. Theconfidentiality andsecurityrequirementsarebecomingmoreandmoreimportant to 
protectthedata transmittedandreceived. Thisleadsto the need for efficient design of cryptographic algorithms 
whichofferdataintegrity, authentication, non-repudiationand confidentialityof the encrypteddataacrossthe communication 
channels. Variouscryptographic algorithmshavebeenstudied andimplemented toensuresecurityofthesesy stems. Inthis 
paper,modulo2n+l multiplier hasbeenmuchfocusasithas found itsimportant roleinlDEAalgorithm. Forexample,the 
threemajoroperationsthatdecidetheoverallperformanceand delayofthelDEA [l,4,15]aremodulo2 16 addition,bitwise XOR and 
modulo2 16 +lmultiplicationand the GF(2 n ) Montgomerymultiplicationandmodularexponentiationcan 

beimplementedusingrepeatedmultiplication andsquaringof the vectors. Amongtheseoperations, 

improvingthedelay andpowerefficiencyofthemodulo2 n + 1 multiplicationoperation leadstosignificant 

increaseintheperformanceoftheentirecryptographiccipher. 

Numeroushardware implementations ofthe IDEAalgorithmareproposedintheliteratureusingdifferentmodulo 
2 16 +lmultiplierarchitectures.TheIDEAalgorithmhasbeen implementedinsoftware[3]onIntelPentiumII445MHzwith 
encryptionrateof23.53Mb/Sec.Later,IDEAwasrealized on hardwarechipbyCurigeret al [l]withencryptionratesup to 
177Mb/sec.Byusingabit-serialimplementation [4] , which enablestheIDEAtobefully,pipelinedtheencryption rates 
reached500Mb/sec withl25MHzclockrate.Theefficiency ofthelDEAciphercanstillbeimprovedifefficient basic modules 
suchasmodulomultipliers and addersareused. Theefficient implementationofthemodulo2 n +l multiplier 
basedonnovelcompressorsandsparsetreebasedinverted end around carry adders is presentedin [7]. Even though 
thearchitecture ofthemodulomultiplierisveryefficiently proposedin[6],thehardwareimplementationandoptimization 
areconsiderablyimprovedin[7].Thisisresulted byreplacing thefulladder arrayswiththenovelcompressors andthefinal 
stageadderwiththesparsetreebasedinvertedendaround carryadder. 

Thepaperisorganizedasfollows; Section II- A introducesmultiplexer-basedcompressors. InSectionII-B,thehardware 
implementationofmodulo2 n + 1 multiplierisgiven. Section IIIdiscussestheproposedimplementationofthelDEAcipher 
whichusesmodulo2 16 +1 multiplier. Acomparisonofourimplementationto a recentlyproposedimplementationismade 
inSectionlV.OurconclusionsaredrawninsectionV. 



II. PRELIMINARIES AND REVIEW 

Novelmultiplexor(MUX)basedcompressorsand2 n + 1 multiplierdesignhavebeenreported in[7] andarebriefly 

re vie wedinthissectionasfollo ws : 
A. Compressors 

1) MUXvs.XOR: 

ExistingCMOSdesignsof2-lMUX and2-inputXORareshowninFig.l. According to[8], the CMOSimplementation 
ofMUXperformsbetterintermsofpoweranddelaycomparedtoXOR.Suppose,XandY are 
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inputstotheXORgate,theoutputisXY+XY.Thesame XORcanbeimplementedusingMUXwithinputsX,Xand 
selectbitY.Theefficientimplementationofcompressors[9] isachievedbyusingbothoutputanditscomplement ofthese 

gates. Thisalsoreducesthetotalnumberofgarbageoutputs. 



2) Descriptionof Compressors: 

A(p,2)compressorwith pinputsXl ,X2 
go vernedby theequation : 



.Xp andtwooutputbitsSumandCarry along withcarryinputbitsandcarryoutputbitsis 




(a) 



(b) 



Fig. I: CMOS implementation of 2-input (a) XOR (b) MUX 



p t p 

J2 Xi + = Sum + 2(Carry + J2( C out)i) 

1 = 1 i = l 2 = 1 

Blockdiagramofa5:2compressorisshowninFig.2.Efficient designoftheexistingXOR-based 5:2compressor [10,1 1], 
whichtakes5inputs and2carryinputs, isshowninFig.3(a). Thecriticalpathdelayofthisexistingcompressoris4A-XOR 
(delay denotedby A) . 
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Fig. 2: Block Diagram of a 5:2 compressor 
Thenewlydesignedcompressorsusemultiplexersinplace ofXORgates,resultinginhighspeedarithmetic. Also, as 

showninFig.3(a)inalltheexistingCMOSimplementations oftheXORandMUX gatesboththeoutput 

anditscomplementareavailablebutthedesignsofcompressors availablein literaturedonotusetheseoutputsefficiently. IntheCMOS 
implementation of the MUX if both the selectbit and its complementaregeneratedintheprevious stage thenitsoutput 
isgeneratedwithmuchless delaybecausetheswitchingofthe transistorisalreadycompleted. Andalsoifboththeselectbit 
anditscomplementaregeneratedinthepreviousstagethentheadditionalstage oftheinverteriseliminatedwhichreducesthe 
overalldelayinthecriticalpath. ThenewMUX-based design of5:2compressors[9] isshownin Fig.3(b),the delayof which isA- 
XOR+3A-MUX CGEN blockusedinFig.3(b) " " canbe 

obtainedfromtheequationCout 1 =(x l -\-x2yx3 +x \ •x2andtheCMOSimplementationisgiveninFig.4. 



Coutl 



a, X 2 X3, Cin2 X 4 a 



Sum 



I ► MUX 



Carry 



? 4_l LJ MUX 



1 

c;out2 



Coutl 



XORXNOR 



1 1 



MUX » 



Sum 



n2 X 4 X b Cin1 



i 1 

XOR-XNOR 

2 3r- 










MUX 




L 




MUX 





Carry 



Fit:. 3: 5:2 compressors: (a) existing XOR-hasecl design; (h) 
new MUX- based design 
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B.Hardware Implementation of the mod 2 n +l Multi plier/Squarer 

The hardwarei mp lementationofthe modulomultiplier consistsofthreemodules .Firstmoduleistogeneratepartial 
products, second module is to reduce the partial products to two final operands and the last module is to add the Sum and 
Carry operands from partial products reduction to get the final result. 

1) Partial products generation: The n x n partial products matrix is obtained from the n + 1-bitinput vectors. This 
partialproduct matrix is generated after repositioning the bits of the initial partial product matrix based on several 
observations presented in [6]. The final partial products matrix after applying all the observations is shown in Fig. 5. The 
partialproductbits can becomputedfrom AND, OR andNOT gates. The most complexfunctionof partialproduct 
generationmoduleis/? n _ u n _ l V qn _ 2 , where/? . y =a l b J andq i =P nf ^Pi, n 
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Fig. 4: CMOS implementation of cany generator block Fig. 5: Final nxn partial product matrix showing 2 n 
(CGEN) for the proposed design. 2 n ~ 3 ...2 2 , 2 1 and 2° columns. 



2) Partialproductsreduction:Thisisthemostimportant modulewhichlargelydeterminesthecriticalpathdelayand 
theoverallperformance ofthemultiplier.Hencethismodule needsto be designedso as to get minimumdelayandconsume 
lesspower. Theimplementations fromtheliterature[5,6, 1 3]usefull adders(FA)andhalfadders(HA)toconstructthismodule. 
Theseriesoffulladdersinanycolumncanbereplacedby thenovelcompressorsthattakethesamenumberofinputs.In 
theproposedimplementationuseofsuggestedcompressorsis done which not only reduces the delay and power consumptionbut 
also the area of the circuit. For a modulo 2 16 + 1 multiplier in IDEA cipher the Carry Save Adder (CSA) arrayimplementation 
using Full Adders requires fifteen full addersin series in any column, these fifteen full adders can bereplaced by two 7:2 
compressors, one 5:2 compressor and two3:2 compressors. 

Correction factor computation is an important step while generating the partial products matrix. The full adder 
implementation [6] and the compressor based implementations [7]result in the same value. Because of the space constraints, 
computation of the correction factor COR for full adder implementation [6] is not given in this paper. COR computation for 
compressor implementation involves computingonly COR2, because COR1 is obtained based on repositioning of the partial 
productterm, which issame for both implementations. The correctionf actor COR2 computation for FA implementation which 
has n-1 stages of additions is shown in [6]. And the COR2 computation for the proposed multiplier implementation using the 
compressors also yields the same result. Since, any (p, 2) compressor can be primarily designed using p- 2 FAs which give p 
- 2 carry outs with 2 n weight. Hence, the overall correction factor COR computation for CSA array FA implementation and 
compressor implementation yield the same result i.e., 3 as shown in [5]. 

3) Final Stage addition:The partial product reduction module gives one n-bit carry vector and one n-bit sum vector which 
need to be added in the final stage addition module. Very efficient parallel prefix adders are designed to do this operation 
[2]. 

Suppose S and C are sum and carry vectors produced after the partial product reduction section. As it is shown in 
the work of Zimmerman [2] that 
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\S + C + l| 2 "+i = \S + C + CW| 2 n ( 1 ) 




Fig. 6: Inverted EAC adder implemented using sparse tree 
structure 

Equation (1) can be implemented using an invertedEnd-Around-Carry adder [2, 5, 6]. Even though the propagation 
delay of this adder is in the order of login, it has a drawback of high interconnect complexity and high fan-out. This can be 
overcome by sparse tree adder [12, 16] based on the prefix network logic. The sparse tree adder generates the carry for every 
four bits instead of generating it at every stage and using a carry select block for selecting the final carry after the prefix 
network. This sparse tree adder was proven to be much more efficient in terms of both delay and power when compared with 
the existing prefix tree based adders [14]. Hence this sparse tree can be used to design Inverted -End- Around-Carry adder. 
The newly designed Inverted-End-Around-Carry adder using sparse tree adder structure is shown in Fig. 6. This Inverted- 
EAC adder is used in the final stage addition of the modulo 2n+l multiplier. Theproposed implementation of the modulo 216 
+ 1 multiplier for IDEA cipher is shown in Fig. 7 and R^Ris : : .'R2R1R0 represents the final product of the modulo 216+1 
multiplier. 

III. NOVEL IMPLEMENTATION OF IDEA CIRCUIT USINGTHE PROPOSED MODULO 2 N + 1 

MULTIPLIER/SQUARER 

The modulo 2n + 1 computation is an integral part of the International Data Encryption Algorithm (IDEA) where n =16 [1, 
4, 15]. Three major operations that decide the overall delay and performance of IDEA cipher are: 

1) Modulo 216 addition; 

2) Bitwise-XOR; 

3) Modulo 216+1 multiplication/squaring. 



As the first two operations take less time and are easy to implement, the delay and power efficiency of the entire 
IDEA cipher depends significantly on the modulo 216+1 multiplication/squaring operation. Hence, the IDEA cipher is 
implemented using the proposed modulo multiplier and compared with the existing implementations. 
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Fig. 7: Novel implementation of the modulo 2 16 +l multiplier using efficient compressors 
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Toencrypt adatablockusingIDEAcipher,thedatashould be processed through three modulo multiplication operationsin a single 
round and the manipulated data again shouldPass through seven such rounds iteratively and a final outputtransformation to 
produce the final encrypted output. ThelDEA cipher takes 64-bit input data and produces a 64-bitcipher text with a 128 -bit 
key. The encryption and decryptionalgorithms in IDEA are almost identical except they utilizetwo different sets of sub key 
generated by the same key with different processes. The IDEA encryption and decryptionprocesses consist of eight rounds of 
data manipulation usingsubkeys and a final output transformation stage. In this cipher,all the operations are carried out on 
16-bit sub-blocks. In theencryption process, the input data block of 64-bits is dividedinto 4 sub blocks of 16-bits each 
(X1;X2;X3;X4). 52 sub-keys for the encryption process are generated from the original 128 -bit key by shifting a part of it. 
Out of the 52 subkeys, sixdifferent subkeys (i.e., Zi (r) ;Z 2 (r \'Z 3 (r) ;Z4 (r) ;Z 5 (r) and Z 6 (r) , where r is the roundnumber) are used for 
each round and theremaining 4 subkeys are used in the final output transformationstage. The 16 -bit outputs at each round are 
represented as7 \ r) \ Y 2 (r) ; Y 3 (r) ; Y 4 (r) and W x ; W2; W3; W4 are the outputs of the final output stage transformation. The 52 
subkeys used for the decryption process are obtained using a different algorithm[17]. As shown in Fig. 8, the critical path 
consists ofthreemodulo 216 + 1 multiplication operations, two modulo 216addition operations and two 16-bit XOR 
operations in eachround. In the final output transformation stage, critical path consists of a single modulo 216 + 1 
multiplication operation.The throughput of the IDEA cipher can be improved, if thedelay of the modulo 216+1 
multiplication operation is reduced in the pipelined implementation of the IDEA cipher. Fig. 8 shows the data path of 
encryption process of the IDEA cipherand datapath of a single round with 4 pipeline stages with the proposed modulo 
multiplier. 

IV. EXPERIMENTAL SIMULATIONAND RESULTS 

The proposed design of the IDEA cipher with four pipelinestages using novel modulo 2^+1 multipliers is used to 
analyse and compare with the well-known IDEA cipher implementations. The use of the novel modulo multiplier improves 
thethroughput and performance of the IDEA cipher significantly. 



A. Simulation environment 

All the simulations have been carried out using Mentor Graphics ASIC (Application-Specific IC) design suite. 
Theproposed IDEA cipher design is specified using Verilog HDLand the multiplier descriptions are mapped on a 0.18 
_mCMOS standard cell library usingLeonardo Spectrum synthesis tool from Mentor Graphics. 
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Fig. 8: Datapath of IDEA cipher with 4 pipeline stages 



The design is optimized forhigh speed performance. Netlists generated from synthesis toolare passed on to standard 
route and place tool; the layouts areiteratively generated to get the circuits with minimum area.The calculation of power and 
delay are carried out using theEldo simulation tool. The proposed experimental simulationhas been performed at 1.8V with 
all inputs fed at a frequency of 25 MHz. 



B. Simulation results 

The IDEA cipher is implemented using both the proposedmultiplier and the multipliers presented in [6]. Various 
performancemeasurements includingencryptionrate,delayand areaforthelDEAcipherusingboththeproposedmultiplier and 
theexisting multiplierareparametricallyobtainedand listed inTablel. As expected, theproposedlDEA circuit 
implementationachievessignificantimprovementsintermsof throughput^, e. encryption rate), latency (i.e. ,criticalpath 
delay) andarea(i . e . , circuitarea) . 
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TABLE I: Comparison of the performance measurements for 
IDEA cipher 

Performance Using proposed Using the mul- % Improve- 
Measurement multipliers tipliers in [6] ment 

Encryption 460.25 412.15 11.25 
Rate (Mb/ sec) 

Critical path 4.372 5.168 15.4 
delay (nS) 

Area of the ci- 3.68 4.22 12.79 
pher { mm 2 ) 



V. CONCLUSION 

A hardware implementation of the IDEA cipher using novelmodulo 2 n +l multipliers ispresented in this paper. It 
isshown that the proposed modulo 2n+\ multiplier improvestheperformance of the variouscryptographic algorithms used 
insecure communication systems of networked instrumentationand distributed measurement systems. Efficient 
compressorsandsparsetreebasedinvertedendaroundcarryaddersare usedtoreducethedelayandcomplexity ofthemultiplier. 
Simulations areperformedontheknownimplementation and theproposedimplementation.Thepresentedimplementation 
isproven toperform betterthantheexistingoneinvarious aspects, (i.e., throughputandcriticalpathdelay). 

REFERENCES 

[I] R. Zimmermann, A. Curiger, H. Bonnenberg, H. Kaeslin, N. Felber andW. Fichtner, A 177 Mb/s VLSI implementation of the 
international dataencryption algorithm, IEEE J. Solid-State Circuits, 1994, 29, (3), pp.303 -307 

[2] R. Zimmerman, Efficient VLSI implementation of modulo (2 n _ l)addition and multiplication, IEEE trans. Comput., 2002, 51, pp. 
1389-1399. 

[3] O. Cheung, K. Tsoi, P. Leong and M. Leong, Tradeoffs in parallel andserial implementations of the international data encryption 

algorithmlDEA, Lecture Notes in Computer Science, vol. 2162, pp. 333-340,2001. 
[4] M. Leong, O. Cheung, K. Tsoi, and P. Leong, A bit-serial implementation of the international data encryption algorithm IDEA, 

2000 IEEESymposium on Field-Programmable Custom Computing Machines, pp. 122-131, 2000. 
[5] C. Efstathiou, H. Vergos, S. Dimitrakopoulos and D. Nikolos, Efficientdiminished-1 modulo 2 n + 1 multipliers, IEEE Trans. 

Comput, 2005,54, pp. 491-496. 

[6] H. Vergos and C. Efstathiou, Design of efficient modulo 2 n + lmultipliers, IETComput. Digit. Tech., 2007, 1, (1), pp. 49-57. 

[7] R. Modugu, N. Park and M. Choi, A Fast Low-Power Modulo2 n + 1 Multiplier Design, 2009 IEEE International 

Instrumentationand Measurement Technology Conference, pp.95 1-956, May 2009. 
[8] R. Zimmermann and W. Fichtner., Low-power logic styles: CMOS versus pass-transistor logic, IEEE J. Solid- State Circuits, vol. 

32, pp. 1079- 1090, July 1997. 

[9] S. Veeramachaneni, L. Avinash, M. Rajashekhar and M. Srinivas, Efficient Modulo (2k _1) Binary to Residue Converters System- 
on-Chip for Real-Time Applications, The 6th International Workshop onDec. 2006 pp.195 - 200. 

[10] C. Chang, J. Gu and M. Zhang, Ultra low-voltage lowpower CMOS4-2 and 5-2 compressors for fast arithmetic circuits, IEEE J. 
Circuitsand Systems I, Vol. 51, No. 10, pp. 1985- 1997, 2004 

[II] M. Rouholamini, O. Kavehie, A. Mirbaha, S. Jasbi and K. Navi, ANew Design for 7:2 Compressors, Computer Systems and 
Applications,2007. AICCSA '07. IEEE/ ACS International Conference on 13-16 May2007 Page(s):474 - 478. 

[12] S. Mathew, M. Anders, R. Krishnamurthy and S. Borkar, A 4-GHzl30-nm address generation unit with 32-bit sparse-tree adder 

coreJEEE Journal of Solid-State Circuits, Volume 38, Issue 5, May 2003Page(s):689 - 695. 
[13] Zhongde Wang, Graham A. Jullien, William C. Miller., An efficient treearchitecture for modulo 2/7+1 multiplication, VLSI Signal 

Processingl4(3): 241-248 (1996). 

[14] P. Kogge and H. S. Stone., A parallel algorithm for the efficient solutionof a general class of recurrence equations, IEEE Trans. 

Comput, vol.C-22, pp. 786-793, Aug 1973. 
[15] A. Curigeret. al., VINCI: VLSI Implementation of the New Secret-keyBlock Cipher IDEA, Proc. of the Custom Integrated 

CircuitsConference, San Diego, USA, May 1993 
[16] Yan Sun, DongyuZheng, Minxuan Zhang, and Shaoqing Li HighPerformance Low-Power Sparse-Tree Binary Adders, 8th 

InternationalConference on Solid state and Integrated Circuit Technology, ICSICT2006. 
[17] Yi-Jung Chen, Dyi-Rong Duh, Yunghsian Sam Han: Improved Modulo(2 n +l) Multiplier for IDEA, J. Inf. Sci. Eng. 23(3):911- 

923(2007) 



www.ijmer.com 



3025 I Page 



